Case Fatality Rate (CFR) of COVID-19

By May 2020, more than 1 million confirmed COVID-19 cases and 60k deaths have been reported by the Johns Hopkins University Coronavirus Resources Center. The early detection of the most relevant factors of deaths due to COVID-19 on U.S. county-level can aid in making decisions on lifestyle changes in high risk patients, distribution of public resources, and in turn reduce the CFR. This study aims to explore the most relevant health factors related to COVID-19 deaths as well as predict the overall risk using logistic regression.

The case fatality rate (CFR) will be used to measure the risk of dying from COVID-19, which is defined as \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}} \]

At the beginning of the investigation, we would like to conduct some descriptive summaries of the COVID-19 and county health information. Specifically, we are interested in visualizing the confirmed cases and deaths through graphs/plots. Additionally, it would be helpful to have a basic understanding of the demographic and health information of each counties and states.

Data Cleaning and Preparation

This project involves two different datasets: one includes COVID-19 cases and deaths, another one includes health-related factors on county levels.

We may want to clean the data and remove missing/error information first.

Descriptive Data Visualization

Graphical summarize the COVID-19 confirmed cases and deaths on 04/04/2020 by state.

Graphical summarize selected health status by US county.

Summarize the 3 counties with highest CFR (>25%) are:

## Selecting by cfr
## # A tibble: 3 x 5
##   county         state    confirmed deaths   cfr
##   <chr>          <fct>        <dbl>  <dbl> <dbl>
## 1 Emmet          Michigan         7      2 0.286
## 2 Grand Traverse Michigan        12      3 0.25 
## 3 Toole          Montana         12      3 0.25

Logistic Regression Prediction

One of our primary interests is to predict COVID-19 death and pinpoint the most relevant factors of it. Logistic regression is usually used for prediction of outcome of a binomial dependent variable from a set of predictors. Logistic regression will also provide the risk of dying from COVID-19 measured by CFR.

Logistic Regression Assumption Check

Before building the logistic model, we want to check several assumptions about the data and make sure the use of logistic regression model is appropriate. Three types of assumptions are checked: linearity, influential values, and multicollinarity.

## `geom_smooth()` using formula 'y ~ x'

All assumptions hold true for our data. The first graoh shows that there is a linear relationship between the logit of the outcome and each predictor variables. The second graph shows that There may be some influential values (extreme values or outliers) in the continuous predictors. For example, New York has a relative greater number of deaths than other counties. After investigating these counties, there is no other apparent problems with them, and we decide to still include them in analysis.
Furthurmore, there is no high intercorrelations (i.e. multicollinearity) among the predictors. Therefore, we may continue to use logistic regression.

Characteristic OR 95% CI p-value
percent_asian 1.03 1.02, 1.03 <0.001
percent_less_than_18_years_of_age 0.94 0.93, 0.96 <0.001
percent_hispanic 0.99 0.98, 0.99 <0.001
percent_fair_or_poor_health 1.05 1.03, 1.07 <0.001
percent_adults_with_obesity 1.03 1.02, 1.04 <0.001
(Intercept) 0.03 0.00, 0.21 <0.001
percent_insufficient_sleep 0.99 0.98, 1.00 0.054
percent_65_and_over 0.99 0.97, 1.00 0.060
percent_with_access_to_exercise_opportunities 1.00 1.00, 1.01 0.10
percent_smokers 0.99 0.96, 1.01 0.2
percent_excessive_drinking 1.01 0.99, 1.02 0.3
percent_adults_with_diabetes 1.01 0.98, 1.03 0.5
percent_black 1.00 1.00, 1.00 0.6
percent_female 1.00 0.96, 1.03 0.9

Goodness of fit check

After obtaining the model, we would like to know if it is the most edquate model for both describing our data and generalizing to other data. Goodness of fit check help us understand how well the logistic model fitting the data by comparing deviance between (a) our model vs saturated model and (b) our model vs null/intercept-only model.

## [1] 2.016365e-96
## [1] 4.686181e-79
  1. For the comparison of the saturated model, the goodness of fit test results in a small p-value, in which we reject the \(H_o\) and conclude that our logistic model provides poorer fit than the saturated model.
  2. For the comparison of the null model, the goodness of fit test results in a small p-value, in which we reject the \(H_o\) and conclude that our logistic model provides a better fit than the intercept-only model here.

In conclusion, our model explains some of the important features of COVID-19 death rate, but there is a room for improvement to fit the data better. Therefore, we would like to try implement some model selection methods.

Based on the analysis of deviance, the 5 most significant predictors are shown below
Df Deviance AIC LRT Pr(>Chi)
percent_asian 1 2912.111 4820.004 62.38003 0.0e+00
percent_less_than_18_years_of_age 1 2908.062 4815.956 58.33191 0.0e+00
percent_hispanic 1 2889.345 4797.238 39.61404 0.0e+00
percent_fair_or_poor_health 1 2870.829 4778.723 21.09870 4.4e-06
percent_adults_with_obesity 1 2870.424 4778.317 20.69325 5.4e-06

Model Selection

The following methods will be implemented with different approaches and focus: (a) Akaike information criterion (AIC) (b) lasso with cross validation.

Perform sequential search using AIC.
The best sub-model given by the AIC criterion is:

Characteristic OR1 95% CI1 p-value
(Intercept) 0.04 0.02, 0.06 <0.001
percent_asian 1.03 1.02, 1.03 <0.001
percent_less_than_18_years_of_age 0.94 0.93, 0.95 <0.001
percent_hispanic 0.99 0.98, 0.99 <0.001
percent_fair_or_poor_health 1.04 1.03, 1.05 <0.001
percent_adults_with_obesity 1.03 1.02, 1.04 <0.001
percent_insufficient_sleep 0.99 0.98, 1.00 0.007
percent_65_and_over 0.99 0.98, 1.00 0.022
percent_with_access_to_exercise_opportunities 1.00 1.00, 1.01 0.11

1 OR = Odds Ratio, CI = Confidence Interval

The best sub-model given by lasso with cross validation (using the AUC criteria) is:

The model given by lasso with cross validation (using the AUC criteria) suggest to include all the predictor in the model, which is similar as the initial logistic model.

Again, compare the initial logistic model to the AIC model:

Characteristic OR1 95% CI1 p-value
percent_asian 1.03 1.02, 1.03 <0.001
percent_less_than_18_years_of_age 0.94 0.93, 0.96 <0.001
percent_hispanic 0.99 0.98, 0.99 <0.001
percent_fair_or_poor_health 1.05 1.03, 1.07 <0.001
percent_adults_with_obesity 1.03 1.02, 1.04 <0.001
(Intercept) 0.03 0.00, 0.21 <0.001
percent_insufficient_sleep 0.99 0.98, 1.00 0.054
percent_65_and_over 0.99 0.97, 1.00 0.060
percent_with_access_to_exercise_opportunities 1.00 1.00, 1.01 0.10
percent_smokers 0.99 0.96, 1.01 0.2
percent_excessive_drinking 1.01 0.99, 1.02 0.3
percent_adults_with_diabetes 1.01 0.98, 1.03 0.5
percent_black 1.00 1.00, 1.00 0.6
percent_female 1.00 0.96, 1.03 0.9

1 OR = Odds Ratio, CI = Confidence Interval

While each model emphasizes different aspect of analyzing the data, several predictors are consistently significant acroos all models.

Interpretations

The following variables are the most significant county-level predictors to COVID-19 CFR, suggested by the models.

  • For one unit increase in the percentage of Asian, the odds of dying from COVID-19 increased by 1.03, or 3%, holding all other features constant.
  • Similarily, for one unit increase in the percentage of less than 18 years old, the odds of dying from COVID-19 reduced by 6%.
  • For one unit increase in the percentage of hispanic, the odds of dying from COVID-19 reduced by 1%.
  • For one unit increase in the percentage of people have fair or poor health, the odds of dying from COVID-19 increaed by 5%.
  • For one unit increase in the percentage of adults with obesity, the odds of dying from COVID-19 increaed by 3%.

Potential Limitations

The potential limitation of the study is discussed from 3 aspects;

  1. CFR measurement
    Again, CFR is defined as: \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}. \] Note that there are some limitations on using CFR to measure COVID-19 deaths rate. CFR assumes the reported deaths and confirmed cases are reflecting the actual information associated with the disease. COVID-19 has prolonged progression and lasts longer than other acute disease. During the long duration from being diagnosed to death, people is likely to die from another disease but still be counted as death due to COVID-19, which leads to an overestimate CFR. It would be more accurate to include this timeframe into the calculation. Similarly, when people actually died from COVID-19 before being confirmed and recorded, CFR will be underestimated.

  2. Unselected predictors
    There are other county-level variables, such as education and SES status, have not been included in the analysis. future analysis may want to include these variables and control for these effects.

  3. Logistic assumption violation
    The identical or independent assumptions of logistic regression may be violated. In this dataset, people are grouped as clusters, which can also lead to undetected heterogeneity and violate the identical assumption, e.g. people from certain places are easier to be affected with COVID-19. There may also be undetected dependence between trials. For example, healthcare resources may be increased by the death of the others, thus, CFRs are correlated. This will lead to a violation on the independent assumption. Both types of violations can lead to inflation of variance.

More studies are encouraged to address these limitations with more available COVID-19 data in the future.